library(ggplot2)
library(plotly)
library(MASS)
library(pls)
library(leaps)

set.seed(2112)

knitr::opts_chunk$set(
  warning = FALSE, 
  message = FALSE
)
options(browser = 'chromium')

Task 2

We are to investigate the effects of noise on PCA, based on iris dataset.

noisy_pca = function(X, c, noise_n, noise_sd) {
  n_obs = dim(X)[1]
  noise = rnorm(noise_n * n_obs, 0, noise_sd)
  dim(noise) = c(n_obs, noise_n)
  pX = cbind(X, noise)

  pca = prcomp(pX)
  ggplot() +
    geom_point(aes(x = pca$x[,1], y = pca$x[,2], color = c))
}

First, iris without noise:

noisy_pca(iris[,1:4], iris[,5], 0, 1)

Then, iris with 100 columns worth of \(\mathcal{N}(0, 1)\) noise:

noisy_pca(iris[,1:4], iris[,5], 100, 1)

We can see that the noisy columns have made the separation between the different types far less clear.

Next, iris with 100 columns \(\mathcal{N}(0, 0.1^2)\) noise:

noisy_pca(iris[,1:4], iris[,5], 100, 0.1)

and with \(\sigma = \mathcal{N}(0, 4^2)\) noise:

noisy_pca(iris[,1:4], iris[,5], 100, 4)

For \(\sigma = 0.1\) the change is fairly small, whereas for \(\sigma = 4\) the change is significant. Let’s look at the variances of the original columns:

apply(iris[,1:4], 2, var)
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##    0.6856935    0.1899794    3.1162779    0.5810063

We can see that adding “features” with \(\sigma = 4\) would necessarily distort PCA since it projects onto the “directions with largest variance”.

Task 3

In this task we investigate the effect of the outliers on PCA.

First, we want to plot a sample from \((4X, 2Y, Z)\) where \(X, Y, Z \sim \mathcal{N}(0, 1)\) (presumably it’s just an cuboid-shaped point cloud):

n = 100
p = rnorm(3 * n)
dim(p) = c(n, 3)

p_df = as.data.frame(p)
colnames(p_df) = c("X", "Y", "Z")

p_df[,"X"] = 4 * p_df[,"X"]
p_df[,"Y"] = 2 * p_df[,"Y"]

fig = plot_ly(p_df, x = ~X, y = ~Y, z = ~Z)
fig = fig %>% add_markers()
fig = fig %>% layout(
  scene=list(
    yaxis = list(scaleanchor="x"),
    zaxis = list(scaleanchor="x")
  )
)
fig

Let’s look at the directions:

prcomp(p_df)$rotation
##           PC1       PC2         PC3
## X  0.99962726 0.0206555 -0.01785165
## Y -0.02279905 0.9912948 -0.12967161
## Z  0.01501782 0.1300303  0.99139628

Nothing particularly interesting. And now let’s look at the same when we substitute the last observation to \((40, 40, 40)\).

p_df[dim(p_df)[1],] = c(40, 40, 40)
prcomp(p_df)$rotation
##         PC1        PC2        PC3
## X 0.6773874 -0.7283802  0.1029977
## Y 0.5310666  0.5810908  0.6166861
## Z 0.5090330  0.3630368 -0.7804420

We can see that the first direction really shifted towards pointing to \((1, 1, 1)\).

Task 4

biopsy[,"ID"] = NULL
biopsy[,"class"] = NULL
biopsy = biopsy[complete.cases(biopsy),]
model = pcr(V2 ~ ., data = biopsy, validation = "CV")

Let’s look at the summary:

summary(model)
## Data:    X dimension: 683 8 
##  Y dimension: 683 1
## Fit method: svdpc
## Number of components considered: 8
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## CV           3.067    1.414    1.356    1.360    1.343    1.323    1.221    1.200    1.182
## adjCV        3.067    1.413    1.355    1.359    1.342    1.337    1.219    1.199    1.179
## 
## TRAINING: % variance explained
##     1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X     67.04    74.99    81.96    86.88    90.92    94.67    97.48    100.0
## V2    78.83    80.55    80.58    81.05    81.75    84.83    85.48     86.1

So, 3 principal components explain \(\geq 80\%\) of data’s variance, and only 2 explain \(\geq 80\%\) of V2’s variance.

Let’s look at which variables contribute the most to the first principal component:

pca1 = model$loadings[,1]
pca1_ord = order(pca1)
pca1_df = pca1[pca1_ord]
colnames(pca1_df) = colnames(pca1)[pca1_ord]
pca1_df
##         V6         V3         V8         V4         V1         V7         V5         V9 
## -0.4927208 -0.4202931 -0.3878608 -0.3635490 -0.3257396 -0.3182930 -0.2687934 -0.1353124

Regarding the coefficients, from what I understand this should be sufficient:

model$Yloadings
## 
## Loadings:
##    Comp 1 Comp 2 Comp 3 Comp 4 Comp 5 Comp 6 Comp 7 Comp 8
## V2 -0.423  0.182         0.121  0.162 -0.354  0.188  0.193
## 
##                Comp 1 Comp 2 Comp 3 Comp 4 Comp 5 Comp 6 Comp 7 Comp 8
## SS loadings     0.179  0.033  0.001  0.015  0.026  0.126  0.035  0.037
## Proportion Var  0.179  0.033  0.001  0.015  0.026  0.126  0.035  0.037
## Cumulative Var  0.179  0.212  0.213  0.228  0.254  0.379  0.415  0.452

So with one PC we’d have \(\beta = -0.423\), and with two \(\beta = [-0.423, 0.182]^T\).

Now we shall check what number of PCs will minimize the error (here MSE) [Apologies for the horrendous plot but I couldn’t be bothered to figure out how to do it in ggplot2]:

validationplot(model, val.type = "MSEP")

Finally, we shall investigate whether the predictors most impactful on PC1 are the same as the ones selected by regsubsets:

rs = regsubsets(V2 ~ ., data = biopsy, nvmax = 8)
x = summary(rs)
x$which
##   (Intercept)    V1   V3    V4    V5    V6    V7    V8    V9
## 1        TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## 2        TRUE FALSE TRUE FALSE  TRUE FALSE FALSE FALSE FALSE
## 3        TRUE FALSE TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE
## 4        TRUE FALSE TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE
## 5        TRUE  TRUE TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE
## 6        TRUE  TRUE TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE
## 7        TRUE  TRUE TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE
## 8        TRUE  TRUE TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

We can see that the selected variables are in the order of V3, V5, V7, V4, V1, V6, V9. So (unless I did the procedure incorrectly) it’s not at all what the PCA would tell us.